Comparison of methods for imputing limited-range variables: a simulation study
نویسندگان
چکیده
BACKGROUND Multiple imputation (MI) was developed as a method to enable valid inferences to be obtained in the presence of missing data rather than to re-create the missing values. Within the applied setting, it remains unclear how important it is that imputed values should be plausible for individual observations. One variable type for which MI may lead to implausible values is a limited-range variable, where imputed values may fall outside the observable range. The aim of this work was to compare methods for imputing limited-range variables, with a focus on those that restrict the range of the imputed values. METHODS Using data from a study of adolescent health, we consider three variables based on responses to the General Health Questionnaire (GHQ), a tool for detecting minor psychiatric illness. These variables, based on different scoring methods for the GHQ, resulted in three continuous distributions with mild, moderate and severe positive skewness. In an otherwise complete dataset, we set 33% of the GHQ observations to missing completely at random or missing at random; repeating this process to create 1000 datasets with incomplete data for each scenario.For each dataset, we imputed values on the raw scale and following a zero-skewness log transformation using: univariate regression with no rounding; post-imputation rounding; truncated normal regression; and predictive mean matching. We estimated the marginal mean of the GHQ and the association between the GHQ and a fully observed binary outcome, comparing the results with complete data statistics. RESULTS Imputation with no rounding performed well when applied to data on the raw scale. Post-imputation rounding and imputation using truncated normal regression produced higher marginal means than the complete data estimate when data had a moderate or severe skew, and this was associated with under-coverage of the complete data estimate. Predictive mean matching also produced under-coverage of the complete data estimate. For the estimate of association, all methods produced similar estimates to the complete data. CONCLUSIONS For data with a limited range, multiple imputation using techniques that restrict the range of imputed values can result in biased estimates for the marginal mean when data are highly skewed.
منابع مشابه
Generalized Family of Estimators for Imputing Scrambled Responses
When there is a high correlation between the study and the auxiliary variables, the rank of the auxiliary variable also correlates with the study variable. Then, the use of the rank as an additional auxiliary variable may be helpful to increase the efficiency of the estimator of the mean or total of the population. In the present study, we propose two generalized familie...
متن کاملEvaluation and comparison of performance of SDSM and CLIMGEN models in simulation of climatic variables in Qazvin plain
Climate change is found to be the most important global issue in the 21st century, so to monitor its trend is of great importance. Atmospheric General Circulation Models because of their large scale computational grid are not able to predict climatic parameters on a point scale, so small scale methods should be adapted. Among downscaling methods, statistical methods are used as they are easy to...
متن کاملComparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be...
متن کاملMultiple imputation for IPD meta‐analysis: allowing for heterogeneity and studies with missing covariates
Recently, multiple imputation has been proposed as a tool for individual patient data meta-analysis with sporadically missing observations, and it has been suggested that within-study imputation is usually preferable. However, such within study imputation cannot handle variables that are completely missing within studies. Further, if some of the contributing studies are relatively small, it may...
متن کاملNumerical simulation of in-cylinder tumble flow field measurements and comparison to experimental results
This paper presents a comparison between measured and predicted results of the in-cylinder tumble flow and the flow coefficient generated by a port-valve-liner assembly on a steady-flow test bench. In this study, computational fluid dynamics (CFD) methods were employed to gain further insight into characteristics of an engine. The purpose was to advance understanding of the stationary turbulenc...
متن کامل